Topics

  • Use simple linear regression to describe the relationship between a quantitative predictor and quantitative response variable.

  • Estimate the slope and intercept of the regression line using the least squares method.

  • Interpret the slope and intercept of the regression line.

1. Libraries

The library() function is used to access functionality that is provided by R packages, but is not included in base R. install.packages() can be used to install new packages. Run this command from the console.

# install.packages("ggplot2")

First, load the package ggplot2 that will be used throughout the tutorial for data visualizations.

library(tidyverse)

2. Exploratory Data Analysis

This tutorial will be using the nhanes dataset where the variables are described in the file nhanes-codebook.txt. Load this data with the load function and specify the rda data file.

load(file='nhanes1518.rda')

The functions head() and names() can be used to explore the data.

head(nhanes1518)
## # A tibble: 6 × 52
##    SEQN WTINT2YR WTMEC…¹ RIDST…² RIAGE…³ RIDAG…⁴ INDFM…⁵ RIDRE…⁶ DMDED…⁷ DMDED…⁸
##   <int>    <dbl>   <dbl>   <int>   <int>   <int>   <dbl>   <int>   <int>   <int>
## 1 83732  134671. 135630.       2       1      62    4.39       3       5      NA
## 2 83733   24329.  25282.       2       1      53    1.32       3       3      NA
## 3 83734   12400.  12576.       2       1      78    1.51       3       3      NA
## 4 83735  102718. 102079.       2       2      56    5          3       5      NA
## 5 83736   17628.  18235.       2       2      42    1.23       4       4      NA
## 6 83737   11252.  10879.       2       2      72    2.82       1       2      NA
## # … with 42 more variables: INDHHIN2 <int>, BMXBMI <dbl>, BMXWAIST <dbl>,
## #   BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>, URXCNP <dbl>,
## #   URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>, URDECPLC <int>,
## #   URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>, URDMBPLC <int>, URXMC1 <dbl>,
## #   URDMC1LC <int>, URXMCOH <dbl>, URDMCOLC <int>, URXMEP <dbl>,
## #   URDMEPLC <int>, URXMHBP <dbl>, URDMHBLC <int>, URXMHH <dbl>,
## #   URDMHHLC <int>, URXMHNC <dbl>, URDMCHLC <int>, URXMHP <dbl>, …
names(nhanes1518)
##  [1] "SEQN"     "WTINT2YR" "WTMEC2YR" "RIDSTATR" "RIAGENDR" "RIDAGEYR"
##  [7] "INDFMPIR" "RIDRETH1" "DMDEDUC2" "DMDEDUC3" "INDHHIN2" "BMXBMI"  
## [13] "BMXWAIST" "BMXWT"    "BMXHT"    "URXUCR"   "WTSB2YR"  "URXCNP"  
## [19] "URDCNPLC" "URXCOP"   "URDCOPLC" "URXECP"   "URDECPLC" "URXHIBP" 
## [25] "URDHIBLC" "URXMBP"   "URDMBPLC" "URXMC1"   "URDMC1LC" "URXMCOH" 
## [31] "URDMCOLC" "URXMEP"   "URDMEPLC" "URXMHBP"  "URDMHBLC" "URXMHH"  
## [37] "URDMHHLC" "URXMHNC"  "URDMCHLC" "URXMHP"   "URDMHPLC" "URXMIB"  
## [43] "URDMIBLC" "URXMNP"   "URDMNPLC" "URXMOH"   "URDMOHLC" "URXMZP"  
## [49] "URDMZPLC" "WTINT4YR" "WTMEC4YR" "WTSB4YR"

We will first explore the dataset by doing exploratory data analysts on the predictor variable age:

# observations with age = 0
nhanes1518%>%filter(RIDAGEYR==0)
## # A tibble: 753 × 52
##     SEQN WTINT…¹ WTMEC…² RIDST…³ RIAGE…⁴ RIDAG…⁵ INDFM…⁶ RIDRE…⁷ DMDED…⁸ DMDED…⁹
##    <int>   <dbl>   <dbl>   <int>   <int>   <int>   <dbl>   <int>   <int>   <int>
##  1 83765  13932.  14073.       2       2       0    3.17       3      NA      NA
##  2 83782   8735.      0        1       2       0    0.98       3      NA      NA
##  3 83798  18168.  18352.       2       2       0    5          5      NA      NA
##  4 83840   5844.   6011.       2       2       0    0.75       1      NA      NA
##  5 83852  17442.  18896.       2       2       0    5          3      NA      NA
##  6 83858   5770.   5651.       2       1       0   NA          1      NA      NA
##  7 83939  17961.  18143.       2       2       0    2.99       3      NA      NA
##  8 83942  12427.  12481.       2       1       0    5          3      NA      NA
##  9 83956  15425.  16860.       2       2       0    4.12       3      NA      NA
## 10 83957   5478.   6025.       2       2       0    0.49       1      NA      NA
## # … with 743 more rows, 42 more variables: INDHHIN2 <int>, BMXBMI <dbl>,
## #   BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## #   URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## #   URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## #   URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## #   URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## #   URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …

With the number of 0 age observations, we want to limit our attention to adults to investigate the effect of age on BMI

nhanes1518 <- nhanes1518%>%filter(RIDAGEYR>=18)
# Basic histogram
ggplot(nhanes1518, aes(x=RIDAGEYR)) + geom_histogram()+labs(x = "waist circumference", title = "Distribution of Age")

# Change the width of bins
ggplot(nhanes1518, aes(x=RIDAGEYR)) + 
  geom_histogram(binwidth=1)+labs(x = "Age", title = "Distribution of Age")

# Change colors
p<-ggplot(nhanes1518, aes(x=RIDAGEYR)) + 
  geom_histogram(color="black", fill="white")+labs(x = "Age", title = "Distribution of Age")

We see that the distribution of waist circumference is asymmetric, with peaks at age of around 80 and 0 respectively.

We can also add the mean line, as well as overlay with transparent density plot. The value of alpha controls the level of transparency:

# Add mean line
p+geom_vline(aes(xintercept=mean(RIDAGEYR)),
            color="blue", linetype="dashed", size=1)

# Histogram with density plot
ggplot(nhanes1518, aes(x=RIDAGEYR)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666") 

3. Simple Linear Regression

We’ll start with a fitting a simple linear model using the lm() function. Instead of attaching the nhanes1518 dataset, we also can specify the data from the lm() function. In the lm() function, the first variable is the response variable and the variables to the right of the ~ symbol are the predictor variable(s). Here we use age as the response, and BMI as the predictor variables.

lm.fit <- lm(RIDAGEYR ~ BMXBMI, data = nhanes1518)

There are several ways that we can examine the model results. First, we can just call the name of the lm() model for a brief summary.

lm.fit
## 
## Call:
## lm(formula = RIDAGEYR ~ BMXBMI, data = nhanes1518)
## 
## Coefficients:
## (Intercept)       BMXBMI  
##     45.4146       0.1153

We can also use the names() function to list all of the names of variables in the lm.fit model:

names(lm.fit)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"

The summary() function gives a more extensive overview of the model fit:

summary(lm.fit)
## 
## Call:
## lm(formula = RIDAGEYR ~ BMXBMI, data = nhanes1518)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.786 -16.164   0.086  14.837  32.948 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.41461    0.73402  61.871  < 2e-16 ***
## BMXBMI       0.11531    0.02413   4.778 1.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.48 on 11094 degrees of freedom
##   (752 observations deleted due to missingness)
## Multiple R-squared:  0.002053,   Adjusted R-squared:  0.001963 
## F-statistic: 22.83 on 1 and 11094 DF,  p-value: 1.795e-06

The coefficients of the linear regression model can be extracted using the coef() function and the confidence interval(s) with the confint() function.

coef(lm.fit)
## (Intercept)      BMXBMI 
##  45.4146068   0.1153059
confint(lm.fit)
##                   2.5 %     97.5 %
## (Intercept) 43.97580422 46.8534094
## BMXBMI       0.06799999  0.1626119

We can use the predict() function to obtain prediction intervals or confidence intervals for a given value of the predictor variable, BMXWAIST. Note that when using the predict function, the column names and format of the new points at which to predict needs to be the same as the original data frame used to fit the lm() model. If you encounter errors using the predict() function, this is a good first thing to check.

predict(lm.fit, data.frame(BMXBMI = (c(70, 80, 90))), interval = "confidence")
##        fit      lwr      upr
## 1 53.48602 51.54110 55.43094
## 2 54.63908 52.22710 57.05106
## 3 55.79214 52.91114 58.67314
predict(lm.fit, data.frame(BMXBMI = (c(70, 80, 90))), interval = "prediction")
##        fit      lwr      upr
## 1 53.48602 17.21825 89.75379
## 2 54.63908 18.34327 90.93489
## 3 55.79214 19.46215 92.12213

We can plot the variables BMXBMI and BMXWAIST using the plot() function and overlay the regression line found using lm() with the abline() function.

plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Age",ylab="Body Mass Index (kg/m**2)")
abline(lm.fit)

Here the dollar sign specifies that we retrieve the predictor columns BMXBMI and BMXWAIST from the dataset nhanes1518

We can experiment with different options for abline() by changing the line width and color in abline().

plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)")
abline(lm.fit, lwd = 3, col = "red")

plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)", col = "red")

The pch argument in plot() changes the shape/type of the points that are plotted.

plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)",pch = 20)

plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)",pch = "+")

Optional: We can make a similar plot using ggplot, where we fit the linear regression model using ggplot().

library(ggplot2)
ggplot(nhanes1518, aes(x = RIDAGEYR, y = BMXBMI)) + 
    geom_smooth(method = "lm", formula = y ~ x, colour = "blue") + 
    geom_point() +
  ggtitle("Age vs. BMI for the nhanes data")

We can use the residuals() and rstudent() functions to extract the residuals and studentized residuals, respectively, from the linear model and plot them along with the predicted values.

plot(predict(lm.fit), residuals(lm.fit))

plot(predict(lm.fit), rstudent(lm.fit))

Additionally, we can compute the influence matrix for the predictors using the hatvalues() function.

plot(hatvalues(lm.fit))

which.max(hatvalues(lm.fit))
## 8723 
## 8196

4. Model Diagonostics & Interpretation

The par() function can be used to create a grid of multiple subplots.

par(mfrow = c(2, 2))
plot(lm.fit)

The diagnostic plots show residuals in four different ways:

  • Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good. The model we fitted shows roughly a linear relationship, with no distinct patterns (such as a fan or funnel shape) in the residuals vs. fitted plot.

  • Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line. The Q-Q plot generally follows the straight dashed line, with some deviations at the end towards high values of theoretical quantiles.

  • Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.

  • Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. Based on the residuals vs. leverage plot, there are no influential points according to Cook’s distance. However, there might be some points with high standard residuals values which could be marked as outliers.

Some other metrics from the model:

  • \(R^2\): From the model above, we have an adjusted R-squared value of 0.2302, which indicates that 23.02% of the variability in the response variable BMI can be explained by the change in the predictor variable age.
  • p value: The p value tells us how likely the data we have observed is to have occurred under the null hypothesis (more material on Null hypothesis on subsequent tutorials), i.e. that there is no correlation between the predictor variable age and the response BMI. From the model above, we have a p value of 2.2e-16, which tells us that the predictor variable age is statistically significant.

5. Multiple Linear Regression

The lm() function can also fit multiple regression models. In this section, we will use RIDAGEYR, BMXWAIST, BMXWT, and BMXHT as predictors of the response variable BMXBMI.

lm.fit <- lm(BMXBMI ~ RIDAGEYR+ BMXWAIST + BMXWT + BMXHT, data = nhanes1518)
summary(lm.fit)
## 
## Call:
## lm(formula = BMXBMI ~ RIDAGEYR + BMXWAIST + BMXWT + BMXHT, data = nhanes1518)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1176 -0.3262  0.0503  0.3471  8.7914 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.4364793  0.2080287   276.1  < 2e-16 ***
## RIDAGEYR    -0.0039875  0.0005317    -7.5 6.88e-14 ***
## BMXWAIST     0.0166524  0.0015277    10.9  < 2e-16 ***
## BMXWT        0.3443669  0.0012653   272.2  < 2e-16 ***
## BMXHT       -0.3463345  0.0011206  -309.1  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8401 on 10529 degrees of freedom
##   (1314 observations deleted due to missingness)
## Multiple R-squared:  0.986,  Adjusted R-squared:  0.986 
## F-statistic: 1.849e+05 on 4 and 10529 DF,  p-value: < 2.2e-16

Model Interpretation:

  • Intercept: The intercept does not have interpretability since it is unrealistic to have age 0, body waist circumference of 0, height and weight of 0.
  • BMXWT: The coeffcient for the predictor BMXWT is 0.2420969, which means that for every unit increase in the participant’s waist circumference, the BMI is expected to increase by 0.2420969 on average, holding all else constant (holding all other predictor variables, BMXWAIST, BMXWT, and BMXHT constant)

In the lm() formula, a dot . can be used to include all variables in the NHANES data as predictors.

nhanes_na_removed<-cbind(nhanes1518[1:5],nhanes1518$BMXBMI)
# data clearning to ignore NA values
lm.fit1 <- lm(nhanes1518$BMXBMI ~ ., data = nhanes_na_removed)
summary(lm.fit1)
## 
## Call:
## lm(formula = nhanes1518$BMXBMI ~ ., data = nhanes_na_removed)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.957e-13 -1.310e-16  2.400e-17  1.780e-16  5.827e-15 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)         -7.587e-15  4.726e-16 -1.605e+01  < 2e-16 ***
## SEQN                -6.901e-21  4.878e-21 -1.415e+00   0.1572    
## WTINT2YR            -1.533e-20  7.213e-21 -2.125e+00   0.0336 *  
## WTMEC2YR             1.481e-20  6.928e-21  2.137e+00   0.0326 *  
## RIDSTATR                    NA         NA         NA       NA    
## RIAGENDR            -2.707e-16  5.427e-17 -4.988e+00  6.2e-07 ***
## `nhanes1518$BMXBMI`  1.000e+00  3.726e-18  2.684e+17  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.842e-15 on 11090 degrees of freedom
##   (752 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.451e+34 on 5 and 11090 DF,  p-value: < 2.2e-16

If we want to exclude specific variables from the list of predictors, we can use the - notation. In the following example, all predictor variables but age are included in the model.

library(dplyr)
nhanes_na_removed<-cbind(nhanes1518[1:5],nhanes1518$BMXBMI)
# data clearning to ignore NA values
lm.fit1 <- lm(nhanes1518$BMXBMI ~ . - RIDSTATR, data = nhanes_na_removed)
summary(lm.fit1)
## 
## Call:
## lm(formula = nhanes1518$BMXBMI ~ . - RIDSTATR, data = nhanes_na_removed)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.957e-13 -1.310e-16  2.400e-17  1.780e-16  5.827e-15 
## 
## Coefficients:
##                       Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)         -7.587e-15  4.726e-16 -1.605e+01  < 2e-16 ***
## SEQN                -6.901e-21  4.878e-21 -1.415e+00   0.1572    
## WTINT2YR            -1.533e-20  7.213e-21 -2.125e+00   0.0336 *  
## WTMEC2YR             1.481e-20  6.928e-21  2.137e+00   0.0326 *  
## RIAGENDR            -2.707e-16  5.427e-17 -4.988e+00  6.2e-07 ***
## `nhanes1518$BMXBMI`  1.000e+00  3.726e-18  2.684e+17  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.842e-15 on 11090 degrees of freedom
##   (752 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.451e+34 on 5 and 11090 DF,  p-value: < 2.2e-16

Including -1 excludes the intercept from the model.

lm.fit1 <- lm(nhanes1518$BMXBMI ~ .- 1, data = nhanes_na_removed)
summary(lm.fit1)
## 
## Call:
## lm(formula = nhanes1518$BMXBMI ~ . - 1, data = nhanes_na_removed)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.434e-12 -3.000e-16  5.000e-16  1.300e-15  9.070e-14 
## 
## Coefficients:
##                       Estimate Std. Error    t value Pr(>|t|)    
## SEQN                -2.674e-19  8.860e-20 -3.018e+00  0.00255 ** 
## WTINT2YR            -5.118e-19  1.310e-19 -3.907e+00 9.41e-05 ***
## WTMEC2YR             4.943e-19  1.259e-19  3.928e+00 8.61e-05 ***
## RIDSTATR            -8.811e-14  4.292e-15 -2.053e+01  < 2e-16 ***
## RIAGENDR            -7.173e-15  9.858e-16 -7.277e+00 3.65e-13 ***
## `nhanes1518$BMXBMI`  1.000e+00  6.768e-17  1.478e+16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.163e-14 on 11090 degrees of freedom
##   (752 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 6.418e+32 on 6 and 11090 DF,  p-value: < 2.2e-16

The update() function can be used to specify a new formula for an existing model.

lm.fit1 <- update(lm.fit, ~. - RIDSTATR)

6. Interaction Terms

There are two ways to include interaction terms in the model, : and *. The : symbol only includes the interaction term between the two variables, while the * symbol includes the variables themselves, as well as the interaction terms. This means that BMXWT*BMXWAIST is equivalent to BMXWT + BMXWAIST + BMXWT:BMXWAIST.

summary(lm(BMXBMI ~  BMXWT* BMXWAIST, data = nhanes1518))
## 
## Call:
## lm(formula = BMXBMI ~ BMXWT * BMXWAIST, data = nhanes1518)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6955  -1.7808  -0.2051   1.5276  20.6194 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.379e+00  4.769e-01   2.892  0.00384 ** 
## BMXWT          4.045e-02  6.577e-03   6.150 8.01e-10 ***
## BMXWAIST       1.913e-01  5.180e-03  36.925  < 2e-16 ***
## BMXWT:BMXWAIST 6.650e-04  5.117e-05  12.997  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.71 on 10530 degrees of freedom
##   (1314 observations deleted due to missingness)
## Multiple R-squared:  0.8539, Adjusted R-squared:  0.8538 
## F-statistic: 2.051e+04 on 3 and 10530 DF,  p-value: < 2.2e-16

A simple way to include all interaction terms is the syntax .^2.

library(dplyr)
nhanes_na_removed<-cbind(nhanes1518[1:5],nhanes1518$BMXBMI)
# data clearning to ignore NA values
summary(lm(nhanes1518$BMXBMI ~ .^2, data = nhanes_na_removed))
## 
## Call:
## lm(formula = nhanes1518$BMXBMI ~ .^2, data = nhanes_na_removed)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.956e-13 -1.400e-16  1.900e-17  1.930e-16  5.786e-15 
## 
## Coefficients: (6 not defined because of singularities)
##                                Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)                  -5.194e-15  2.375e-15 -2.187e+00   0.0288 *  
## SEQN                         -1.453e-20  2.518e-20 -5.770e-01   0.5640    
## WTINT2YR                     -1.138e-19  1.393e-19 -8.170e-01   0.4138    
## WTMEC2YR                      1.221e-19  1.344e-19  9.090e-01   0.3634    
## RIDSTATR                             NA         NA         NA       NA    
## RIAGENDR                     -1.760e-15  9.395e-16 -1.873e+00   0.0611 .  
## `nhanes1518$BMXBMI`           1.000e+00  6.442e-17  1.552e+16   <2e-16 ***
## SEQN:WTINT2YR                 7.582e-25  1.433e-24  5.290e-01   0.5967    
## SEQN:WTMEC2YR                -8.528e-25  1.384e-24 -6.160e-01   0.5377    
## SEQN:RIDSTATR                        NA         NA         NA       NA    
## SEQN:RIAGENDR                 1.502e-20  9.817e-21  1.530e+00   0.1261    
## SEQN:`nhanes1518$BMXBMI`     -3.974e-22  6.755e-22 -5.880e-01   0.5564    
## WTINT2YR:WTMEC2YR            -4.070e-27  7.030e-27 -5.790e-01   0.5626    
## WTINT2YR:RIDSTATR                    NA         NA         NA       NA    
## WTINT2YR:RIAGENDR            -7.333e-21  1.484e-20 -4.940e-01   0.6212    
## WTINT2YR:`nhanes1518$BMXBMI`  7.194e-22  1.182e-21  6.090e-01   0.5429    
## WTMEC2YR:RIDSTATR                    NA         NA         NA       NA    
## WTMEC2YR:RIAGENDR             7.441e-21  1.429e-20  5.210e-01   0.6026    
## WTMEC2YR:`nhanes1518$BMXBMI` -7.133e-22  1.145e-21 -6.230e-01   0.5332    
## RIDSTATR:RIAGENDR                    NA         NA         NA       NA    
## RIDSTATR:`nhanes1518$BMXBMI`         NA         NA         NA       NA    
## RIAGENDR:`nhanes1518$BMXBMI`  2.472e-18  7.725e-18  3.200e-01   0.7490    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.843e-15 on 11080 degrees of freedom
##   (752 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.834e+33 on 15 and 11080 DF,  p-value: < 2.2e-16

7. Categorical Variable

nhanes_na_removed<-cbind(nhanes1518[1:5],nhanes1518$BMXBMI, nhanes1518$INDHHIN2)
nhanes_income <- dplyr::rename(nhanes_na_removed,income = "nhanes1518$INDHHIN2")%>% na.omit()
nhanes_income <- dplyr::rename(nhanes_income,BMI = "nhanes1518$BMXBMI")
# turn quantitative variable into categorical variable
nhanes_income$income<-as.character(nhanes_income$income)

Now we explore the effect of income as a categorical predictor variable on the response BMI. Refer to the website for encoding of income categories:

https://wwwn.cdc.gov/nchs/nhanes/2011-2012/demo_g.htm#INDHHIN2

We want to first drop categories with values 77 (Refused) and 99 (Don’t Know) first:

nhanes_income <- subset(nhanes_income, income!="77" & income!="99")

Then we fit the linear regression on categorical variable income:

summary(lm(BMI ~ income, data = nhanes_income))
## 
## Call:
## lm(formula = BMI ~ income, data = nhanes_income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.053  -5.097  -1.153   3.796  55.708 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.08676    0.42827  67.917  < 2e-16 ***
## income10     0.25666    0.53138   0.483  0.62910    
## income12     0.07855    0.56365   0.139  0.88917    
## income13     1.15974    0.72021   1.610  0.10737    
## income14     0.35724    0.48585   0.735  0.46218    
## income15    -0.60605    0.46019  -1.317  0.18788    
## income2      1.16654    0.57274   2.037  0.04170 *  
## income3      0.65874    0.52391   1.257  0.20866    
## income4      0.39053    0.51242   0.762  0.44599    
## income5      0.61757    0.51184   1.207  0.22763    
## income6      0.91220    0.47882   1.905  0.05680 .  
## income7      0.86657    0.48244   1.796  0.07249 .  
## income8      0.96502    0.49816   1.937  0.05275 .  
## income9      1.40552    0.51288   2.740  0.00615 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.255 on 10176 degrees of freedom
## Multiple R-squared:  0.006989,   Adjusted R-squared:  0.00572 
## F-statistic: 5.509 on 13 and 10176 DF,  p-value: 4.42e-10

Baseline: income category 1 corresponding to a household income of 0 to 4,999 dollars.

Model Interpretation:

  • Intercept: The intercept 25.6489 means that for people in the baseline income category (income category 1 corresponding to a household income of 0 to 4,999 dollars), the BMI is expected to be 25.6489 on average.

  • income6: The coeffcient for the predictor income6 is 1.1473, which means that for participants with household income category 6 (25,000 to 34,999 dollars per ear), the BMI is expected to be 1.1473 higher than that of participants with household income in category 1 (0 to 4,999 dollars), on average.